Skip to content

feat: Add Spark-compatible encode function to datafusion-spark#21331

Open
JeelRajodiya wants to merge 14 commits into
apache:mainfrom
JeelRajodiya:feat/spark-encode-function
Open

feat: Add Spark-compatible encode function to datafusion-spark#21331
JeelRajodiya wants to merge 14 commits into
apache:mainfrom
JeelRajodiya:feat/spark-encode-function

Conversation

@JeelRajodiya

@JeelRajodiya JeelRajodiya commented Apr 3, 2026

Copy link
Copy Markdown
Contributor

Rationale

The datafusion-spark crate is missing the encode function. Spark's encode(expr, charset) converts a string or binary value into binary using a specified character encoding — commonly used in Spark SQL workloads and needed by engines built on DataFusion that target Spark compatibility.

What changes are included in this PR?

Adds SparkEncode to datafusion-spark's string functions, emulating Spark 3.5 semantics. It supports US-ASCII, ISO-8859-1, UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE, including common aliases (UTF8, LATIN1, etc.) and case-insensitive matching. The charset can be a constant or a per-row column. Binary input is decoded as lossy UTF-8 (invalid bytes → U+FFFD) before re-encoding, and unmappable characters are silently replaced with ?, matching Spark.

Are these changes tested?

Yes. Coverage lives in encode.slt (sqllogictest) and exercises all charsets and aliases, case-insensitive matching, null value/charset handling, per-row charsets, binary input (Binary/LargeBinary/BinaryView) with lossy UTF-8, Utf8View input, and the unsupported-charset error. A Rust unit test covers return-field nullability.

Are there any user-facing changes?

New encode scalar function available when using datafusion-spark.

@github-actions github-actions Bot added the spark label Apr 3, 2026
@Zeel-e6x

Zeel-e6x commented Apr 3, 2026

Copy link
Copy Markdown

run benchmarks

@adriangbot

Copy link
Copy Markdown

Comment thread datafusion/spark/src/function/string/encode.rs Outdated
Comment thread datafusion/spark/src/function/string/encode.rs
Comment thread datafusion/spark/src/function/string/encode.rs

@xanderbailey xanderbailey left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me but you’ll need a committer to Approve also! Thanks for the PR!

@JeelRajodiya

JeelRajodiya commented Apr 7, 2026

Copy link
Copy Markdown
Contributor Author

Hey @xanderbailey, Do I need to mention the maintainers for review? if yes please suggest whom incase you know.
I'm planning to open more PRs for implementing other functions but I'm waiting for this PR to get merged.

@xanderbailey

Copy link
Copy Markdown
Contributor

They will normally pick it up within a week or so. If not we can ping them here.

@alamb

alamb commented Apr 7, 2026

Copy link
Copy Markdown
Contributor

Thanks @xanderbailey and @JeelRajodiya -- the PR load is pretty intense! I started the CI for this PR

@JeelRajodiya JeelRajodiya force-pushed the feat/spark-encode-function branch from 22a2705 to bf46433 Compare April 7, 2026 18:54
@JeelRajodiya

Copy link
Copy Markdown
Contributor Author

I pushed the fixes for clippy errors. @alamb Can you rerun the checks please?

@JeelRajodiya

JeelRajodiya commented Apr 8, 2026

Copy link
Copy Markdown
Contributor Author

Please rerun the checks

}
Ok(bytes)
}
_ => exec_err!(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark also supports UTF-32. It would be worth adding a comment here explaining why this isn't or can't be supported.

  arguments = """
    Arguments:
      * str - a string expression
      * charset - one of the charsets 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16', 'UTF-32' to encode `str` into a BINARY. It is case insensitive.
  """,

@JeelRajodiya JeelRajodiya Apr 15, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed adding support for UTF-32, I've added it now with respective tests.

Comment thread datafusion/spark/src/function/string/encode.rs
@github-actions github-actions Bot added the core Core DataFusion crate label Apr 15, 2026
@JeelRajodiya JeelRajodiya force-pushed the feat/spark-encode-function branch from 55f4694 to 835ae8d Compare April 15, 2026 05:32
@github-actions github-actions Bot removed the core Core DataFusion crate label Apr 15, 2026
Implements `encode(string_or_binary, charset)` that converts a string
or binary value into binary using the specified character encoding,
matching Apache Spark's behavior.
In ANSI mode (default), encoding a character that cannot be represented
in the target charset (e.g. non-ASCII char in US-ASCII) returns an
error. In legacy mode, unmappable characters are silently replaced
with '?'.
@JeelRajodiya JeelRajodiya force-pushed the feat/spark-encode-function branch from 835ae8d to 6cb99a7 Compare April 15, 2026 05:40
@JeelRajodiya JeelRajodiya requested a review from andygrove April 15, 2026 05:43
@andygrove

Copy link
Copy Markdown
Member

Thanks for iterating on this @JeelRajodiya. One issue I noticed:

spark-sql> SELECT hex(encode('A', 'UTF-32'));                                                                                                                                                                        
00000041                                                                 

This PR returns 0000FEFF00000041, with a BOM. Both Spark 3.5 and Spark 4.1 return the four-byte form.

Once challenge for this PR is that there is different behavior across Spark versions for the encode expression. Which Spark version is this PR targeting? It would be good to document that.

@JeelRajodiya JeelRajodiya force-pushed the feat/spark-encode-function branch 3 times, most recently from dd0ad0e to 151ac23 Compare April 19, 2026 12:23
@JeelRajodiya

JeelRajodiya commented Apr 19, 2026

Copy link
Copy Markdown
Contributor Author

Hey @andygrove, I realized that I shouldn't be using enable_ansi_mode flag inside encode function. In the spark definition they are not binding the ansi mode to encode function.

Moreover we should target Spark 3.5 which is more permissive and doesn't return errors when null inputs are passed. it simply replaces it with ?. But I've added a TODO in the doc comment pointing at the two real Spark 4.1 configs so a follow-up PR can wire them properly.

Below are the references to the spark definitions
Spark 3.5's Encode.scala:

protected override def nullSafeEval(input1: Any, input2: Any): Any = {
  input1.asInstanceOf[UTF8String].toString.getBytes(toCharset)
}

Just calls Java's String.getBytes, which replaces unmappable chars with the charset's default byte (?). No legacyErrorAction, no config, no exception.

Spark 4.1's Encode.scala added two new configs for the strict behavior:

case class Encode(str, charset, legacyCharsets: Boolean, legacyErrorAction: Boolean)
  def this(value, charset) =
    this(value, charset, SQLConf.get.legacyJavaCharsets, SQLConf.get.legacyCodingErrorAction)

Setting legacyErrorAction=true restores the Spark 3.5 ? behavior.

These spark.sql.legacy.javaCharsets and spark.sql.legacy.codingErrorAction are supported in 4.1 version. which can be left for future PR. Currently the PR targets Spark 3.5. I've added mentioned in the doc comment as well.

Let me know if we need to iterate on this further.

P.S I've fixed the BOM issue

Spark 3.5 and 4.1 both emit UTF-32 as UTF-32BE without a BOM. Our
previous implementation prepended a 0000FEFF BOM, which didn't match
any Spark version. Fix this so encode('A', 'UTF-32') produces
00000041 (4 bytes), matching Spark.

Also add a doc comment clarifying:
- Target Spark version (3.5 charset behavior, accepts aliases)
- UTF-32 semantics (alias for UTF-32BE)
- ANSI mode mapping to Spark 3.5 vs 4.0 unmappable-char behavior
@JeelRajodiya JeelRajodiya force-pushed the feat/spark-encode-function branch from 151ac23 to 701850d Compare April 19, 2026 12:29
@JeelRajodiya

Copy link
Copy Markdown
Contributor Author

@andygrove Can you reivew the PR?

Comment thread datafusion/spark/src/function/string/encode.rs Outdated
Comment thread datafusion/spark/src/function/string/encode.rs Outdated
Comment thread datafusion/spark/src/function/string/encode.rs Outdated
Comment thread datafusion/spark/src/function/string/encode.rs Outdated
Comment thread datafusion/spark/src/function/string/encode.rs Outdated
@JeelRajodiya JeelRajodiya requested a review from Jefffrey June 29, 2026 11:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

spark sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants